Corelab Seminar
2014-2015
Loukas Kavouras (NTUA)
Algorithms for the k-means Clustering Problem
Abstract.
Clustering is the task of grouping a set of objects in such a way that objects
in the same cluster are more similar to each other than to those in other clusters. It is a main task of data
mining, machine learning and computational geometry. In this presentation, we discuss famous clustering problems
and we emphasize on the k-means clustering problem, where one seeks to partition n observations into k clusters
so as to minimize the within-cluster sum of squares. We present Lloyd's algorithm for the k-means problem, which
was identified as one of the top 10 algorithms in data mining. Although Lloyd's algorithm has an exponential
running time in the worst case, it usually runs fast in many practical applications. However, the algorithm
gives no guarantees and there are natural examples where it may produce arbitrarily bad clusterings. k-means++
algorithm addresses this problem by augmenting Lloyd's algorithm with a simple and intuitive seeding technique.
A formal proof shows that k-means++ algorithm is O(log k) competitive. Next, we consider cases where the entire
input is not available from the beginning. That is, we examine an algorithm for k-means in the streaming model,
where the data is too large to be stored in main memory and must be accessed sequentially.